Name: Mohamed Ayman Mohamed

ID : 900182267

In this notebook, we will explore the data and brings out the insights. I am using EDA approach to analyze the data with a help of various tools and graphical techniques.

1_data-analysis-for-ML_how-we-use-dataAnalysis_2016-05-16.webp

Data Exploration

1. Understanding the Data

Before visualize the data, we need to deal with categorical data

1.We can notice that from above (genre, artist_name, track_name, track_id, key, and major) are catagorical features.

Genre (The Ground Truth): can be considered for each genre as a unique id, as we will not put it in a model for prediction.

2.artist_name feature is quite problematic, as it includes important value for the data, as most of the arists tend to release one genre songs (9566 artists). However, we need to take into consideration the number of songs that each artist releases. We have two options to convert this feature string to numerical representations. Text Embedding is the best option to be able to take semantics into consideration. Another option is using one Hot Encoding. However, we will need to have a powerful PC to pre-process such a thing. We cannot represent it as unique IDs, as unique IDs will noise the data more (we need to have it in higher dimensions). For sake of simplicity, I will drop it.

3.artist_id is not necessery at all for the data

5.key will be converted to one hot encoding

6. Major is also considered as a categorical feature

To be able to visualize the data itself, I will give Major and key unique IDs. Then, it will be converted to one hot encoding in data wrangling and feature engineering section

2.Data Visualization

2.1 Visualize the Ground Truth with their frequencies to see whether the data is balanced or not for further training.

2.2 Visualize each Feature using skewness and Kurtosis:

https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm#:~:text=Skewness%20is%20a%20measure%20of,relative%20to%20a%20normal%20distribution.

Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

1_Gqd6Ioie0sa_Hryb3grsFQ.png

Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution (4th moment in the moment-based calculation). That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case.

800px-Standard_symmetric_pdfs.svg.png

2.3 Visualize THe BoxPlot with each feature in Univariate and Multivariate analyses to help us detect the Outliers

2.3.1 Uni-variate(one variable outlier analysis) Outliers:

2.3.2 Multi-variate(More Than one Variable outlier analysis) Outliers:

2.4 BiVariate Data Analysis

1_bB56fmAsXp9FKcfCsv9P1g.png

2.4.1 Using Heatmaps Representing Correlations

3.3 Checking the correlations between continous features

3.4 Comparing between 3 categorical Features with respect to the Genre

Checking if two categorical variables are independent can be done with Chi-Squared test of independence.

Chi-Square test of independence is most commonly used to test association between two categorical variables. The output gives us p-value, degrees of freedom and expected values.

The Steps for using Chi-Square Test:Setting Hypothesis, Prepare Contingency Table, Getting Expected Value Count, Comparing Observed Value with Expected Value and concluding the Hypothesis

We know that if Pvalue >0.05, the two features are independent. (Accepting Null Hypothesis)

Popularity vs artist Name

From Visualizing the Data, we can notice that

1. There is no correlation between temp and other features

2. The correlation between categorical features are nearly 0, as p_value for all of them > 0.05, meaning that they are independent

3. There are some strong correlations between continous features as we can see from the graphs drawn above.

4. We can notice there are a number of outliers needed to be deleted to avoid noise during the training Step (see the kurtosis and IQR)

5. We can notice that Last Genre "A Capella" has 119 in the data. It means that the data is somehow imbalanced. There are multiple basic solutions to solve it, such as oversampling and undersampling. However, this will lead to other problems. I need to do through research to see better solutions other than deleting the examples.

We can conclude that

1. Temp, artist_id, track_name, and artist_name should be dropped as explained above

2. Remove all Outliers in the input features

3. We will need to rescale the data features

4. Converting the categorical features to one hot encoding

Data Wrangling and Feature Engineering

3.1 Dropping unnecessary features

3.2 Handling the problem of Outliers

Outliers lead to many problems "sensitivity to the noise". Now, we can solve the problem using different techniques. If we delete all outliers, the majority of the data will be deleted. If we leave the outliers, problems will occur.

Through research, I have found that the features whose interval between [0,1] have come from APIs, so it does not mean that they are outliers. However, duration_ms is pointy (according to kurtosis), so we will remove the outliers with duration_ms only

3.3 Seperate the label (ground Truth from the features and give to each label a unique id)

3.4 Converting all categorical features to one hot encoding

3.5 Handling negative Numbers

3.5 Scaling and Normalization

We can see from above, after rescaling, skew and kurtosis approaches to the symmetric state.

4. Data Modeling (Clustering)

There are a number of clustering methods, as Affanity propgation, Expectation Maximization, Kmeans, and Hirechical Clustering. Affinity propgation and Expecation maximization needs huge memory to be able to store the data. More clearly, we know that affinity propgation creates three 2D matrix to calculate the exemplers with calculating responsability, availability, and self-availbility. Also, we know that K-means clustering is a special case of EM. EM is using distributions (mean and variance) to form clusters (Soft Clustering). Generally, I will be using K-means for visualizing the data, as it is the simplist one in terms of complexity

4.1 K-means Clustering (Hard Clustering) I will use Elbow Analysis to get the best number of clusters

We can see below that according to Distortion, it is better to choose the number of clusters between 10 to 15, while inertia method proves that 5 is the best. Due to huge data points, we will not be able to distinguish between them clearly. We use the Cluster when the following equation image.png is minmized abruptly

4.1.2 Using Distortion

4.1.2 Using Inertia

Just to draw the clustering for visualization. I will take a subset from the data

5. Training the Model with Pre-processed DataSet

Now the data is ready for training. Since we are only doing EDA, we do not need to train the data on a ML model. Generally, the problem is a classification problem (multiclass). We can use neural network, logistic regression, SVM, Decision trees, and any other classifiers to solve the problem

Refrences:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

https://realpython.com/k-means-clustering-python/

https://www.geeksforgeeks.org/box-plot-in-python-using-matplotlib/

https://www.kaggle.com/adisrw/spotify-data-analysis-using-python/data

https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html

https://towardsdatascience.com/statistics-in-python-using-chi-square-for-feature-selection-d44f467ca745

https://medium.com/@outside2SDs/an-overview-of-correlation-measures-between-categorical-and-continuous-variables-4c7f85610365

https://medium.com/@atanudan/kurtosis-skew-function-in-pandas-aa63d72e20de

https://t.co/rurMFoBmlY